Efficiency & Ultra Threading:
For ATI's first Shader Model 3.0 architecture, the firm's engineers believe that the key parts of the Shader Model 3.0 specification have been the primary focus. The focus of the architecture is clear, with both efficiency and the ability to handle multiple threads of data at once having a big influence on the way that things have been designed.
The architecture's efficiency is focused on, but not limited to, their ability to accelerate a Shader Model 3.0 feature known as Dynamic Branching, a key programming technique. In order to achieve this acceleration and efficiency, ATI has implemented a multi-threaded core design - the heart of which is the Ultra-Threaded Dispatch Processor. This makes it easy to handle a large number of threads of all sizes.
The Ultra-Threaded Central Dispatch Unit (UTCDU) is able to handle up to 512 threads across an array of pixel processors. The ability to work on multiple threads in the dedicated dispatch unit means that large portions of shader code can be skipped for certain pixels where it is deemed to be unnecessary to run
waste code on pixels that don't require it. Essentially, it breaks the pixel processing workload required to render an image into a large number of smaller tasks. The threads consist of small 4x4 blocks of pixels that are coupled together because they require the same shader code to be executed on them.
Pixel & Vertex Shaders:
Pixel Shader: ATI has chosen to decouple the processing units inside R520, making comparisons to other architectures increasingly difficult. There are still the conventional pixel shader and vertex shader units, and also the rasteriser (labelled the Render Back-End in the diagram on the previous page), but the texture units have been moved away from their traditional home attached to each pixel shader engine. Despite this, there have been few changes to the guts of each pixel shader, it still has the same processing ability with each pixel shader being capable of working on up to six pixel shader instructions per clock in the form of two scalar and two three-component vector instructions, along with a single flow control instruction at the same time.
ATI have focused on another key point that is required for improving pixel shader efficiency: the ability hide latency and avoid wasted clock cycles and pipeline stalls. They have worked around this efficiency problem by addressing the way that textures are fetched and stored.
As such, the texture address units and samplers have been made a little more flexible. R520 still has a 1:1 ratio of texture units and pixel pipelines, as is seen in ATI's previous architectures, but the two components work in blocks of four via the UTCDU. This means that any of the four pixel shaders inside each pixel shader core can be allocated to any one of the four texture units, rather than it being one texture unit per pipeline.
Vertex Shader: We mentioned that R520 has eight vertex shader units with each one capable of Vertex Shader 3.0 and Dynamic Branching. One item that was left out of the vertex shader on the X1000 series was the capability of vertex texturing. X1800 shows the capability for Vertex Texturing, but there are no texture formats exposed - meaning that it is impossible to do a vertex texture fetch on ATI's current hardware. We spoke to ATI about this, and we were given the following information:
"Very simply, the ability to do a vertex texture fetch is not compulsory for SM3 - it is an optional feature. We are strongly recommending that developers go with render to vertex buffer because this gives them the solution that they have been looking for all along which is the ability to move output from the pixel engine to the vertex engine."
There is a loop hole in the Vertex Shader 3.0 specification, which states that the vertex shader only needs to show the capability for Vertex Texturing and isn't actually required expose any texture formats that can be used for Vertex Texture Fetch. This means that the chip is Vertex Shader 3.0 compliant, despite lacking full vertex texture fetching capabilities.
Basically, we feel that the non-exposure of Vertex Texturing texture formats has saved ATI a lot of die space. ATI believe that it is impossible to use the feature without huge performance hits due to latency, or a huge real estate taken up by texture caches on the die in order to combat the latency issue. Quite clever when you think about it, seeing as there's an alternative way of getting around this by using the render to vertex buffer shader instruction. However, since this feature is usable on NVIDIA hardware, ATI may find this to be a problem in the future.
Want to comment? Please log in.